model_management: fix free_memory race with cleanup_models finalizer#14409
model_management: fix free_memory race with cleanup_models finalizer#14409liminfei-amd wants to merge 2 commits into
Conversation
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughModel unload and cleanup operations now safely handle concurrent weakref finalizer reentrancy. LoadedModel.is_dead() guards against unset real_model before the weakref is created. free_memory collects unload candidates as tuples containing the LoadedModel wrapper object, pins each wrapper during unload, then removes entries from current_loaded_models by exact object identity instead of index-based operations to tolerate concurrent finalizer removals. cleanup_models adds bounds-checking and guarded weakref verification during iteration to prevent index-out-of-range errors when finalizers remove entries mid-loop. Changes
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@comfy/model_management.py`:
- Around line 837-841: The loop currently uses equality (via LoadedModel.__eq__)
when checking/removing items from current_loaded_models; change it to remove by
object identity instead: inside the for shift_model in unloaded_model loop, find
the index/item in current_loaded_models using an "is" comparison (e.g., iterate
with enumerate and test current_loaded_models[i] is shift_model) and pop that
exact index, then append shift_model to unloaded_models; reference
LoadedModel.__eq__, current_loaded_models, unloaded_models, and the loop over
shift_model to locate where to update the check/remove logic.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: ASSERTIVE
Plan: Pro
Run ID: 3bd49111-5995-4393-a8d3-da9ad2caf4c3
📒 Files selected for processing (1)
comfy/model_management.py
| # mid-loop (model_unload below may drop the last real_model strong ref), which | ||
| # would leave a cached absolute index stale -> IndexError / 'NoneType' has no | ||
| # attribute 'is_dynamic'. 'i' is kept only as a stable, unique sort tiebreaker so | ||
| # the LoadedModel objects are never compared by sorted(). |
There was a problem hiding this comment.
when you fix code you don't need to leave behind a comment describing the old broken code. Just remove this.
|
Thanks for picking this up — the object-carry approach matches what we run in production and it does kill the
Also worth folding in: |
547b72e to
93a61ce
Compare
|
Thanks for the review, @rattus128 and @coderabbitai — both addressed in the latest push:
Let me know if you'd prefer a different style for the identity removal. |
|
Tip For best results, initiate chat on the files or code changes. 🧩 Analysis chain🏁 Script executed: #!/bin/bash
gh pr diff 14409 --repo Comfy-Org/ComfyUI | head -200Length of output: 2908
One note: a) First-loop weakref race — model = shift_model.model
if model is None:
continue
can_unload.append((-shift_model.model_offloaded_memory(), sys.getrefcount(model), shift_model.model_memory(), i, shift_model))b) Second-loop model = shift_model.model
if model is None:
continue # or treat as non-dynamic
if model.is_dynamic() and for_dynamic:
memory_required -= model.loaded_size()
memory_to_free = 0c) Would you like to fold these in? Happy to help draft the changes if useful. |
93a61ce to
b533de8
Compare
|
@j2gg0s thank you — that's an excellent, well-evidenced analysis. You're right that carrying the object only
I credited you in the PR description. Does this match the shape you're running, or would you handle any of |
|
There are multiple claims to have hit this error in production as a crash yet I have not seen evidence of the reachability of freeing from free_memory context. The latest version is an explosion of complexity beyond the first stated scope. Please provide actual proof of bug against the current master with a real stack trace of the original issue. No hypothetical about threads the dont exist. Comfy is single threaded by design to avoid snowballing complexities. |
…odels (Comfy-Org#14365) free_memory() caches absolute indices into the unlocked current_loaded_models list and dereferences them later in the same function. model_unload() can drop the last reference to a real_model, whose weakref.finalize(real_model, cleanup_models) then runs cleanup_models() reentrantly on the SAME thread and pops dead entries from current_loaded_models mid-loop. The cached index goes stale, so the next iteration raises IndexError (or, when the index now points at a collected wrapper, AttributeError: 'NoneType' has no attribute 'is_dynamic') at the current_loaded_models[i] dereference. This is reproducible single-threaded on current master with no background threads: it is reentrant GC reclamation, not concurrency. Fix, scoped to free_memory only: carry the LoadedModel object through can_unload instead of its index (i stays as a sort tiebreaker), operate on that object, and remove unloaded entries by identity after the loop, tolerating any a reentrant cleanup_models() already popped. cleanup_models() and the LoadedModel accessors are left unchanged. Signed-off-by: liminfei-amd <91481003+liminfei-amd@users.noreply.github.com>
b533de8 to
59d851d
Compare
There was a problem hiding this comment.
🧹 Nitpick comments (1)
comfy/model_management.py (1)
835-843: 💤 Low valueConsider trimming the explanatory comment.
The identity-based removal logic is correct and properly addresses the race condition. However, per prior feedback from rattus128, comments describing the old broken behavior aren't necessary—the code speaks for itself. A shorter comment like
# Remove by identity; tolerate concurrent cleanup_models() pops.would suffice.Suggested simplification
for shift_model in unloaded_model: unloaded_models.append(shift_model) - # Drop this exact object by identity: LoadedModel.__eq__ compares the wrapped model - # (None after unload), so `in`/list.remove could match the wrong wrapper or miss it. - # cleanup_models() may already have popped it reentrantly, so tolerate its absence. + # Remove by identity; tolerate concurrent cleanup_models() pops. for idx in range(len(current_loaded_models) - 1, -1, -1): if current_loaded_models[idx] is shift_model: current_loaded_models.pop(idx) break🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@comfy/model_management.py` around lines 835 - 843, The multi-line comment in the model unloading loop in comfy/model_management.py (starting with "Drop this exact object by identity") is overly verbose and describes implementation details unnecessarily. Trim this comment to a single, concise line such as "Remove by identity; tolerate concurrent cleanup_models() pops." that conveys the essential purpose without explaining the old broken behavior. Keep the identity-based removal logic (the for loop iterating backwards through current_loaded_models with the `is` identity check and pop operation) unchanged.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Nitpick comments:
In `@comfy/model_management.py`:
- Around line 835-843: The multi-line comment in the model unloading loop in
comfy/model_management.py (starting with "Drop this exact object by identity")
is overly verbose and describes implementation details unnecessarily. Trim this
comment to a single, concise line such as "Remove by identity; tolerate
concurrent cleanup_models() pops." that conveys the essential purpose without
explaining the old broken behavior. Keep the identity-based removal logic (the
for loop iterating backwards through current_loaded_models with the `is`
identity check and pop operation) unchanged.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 1b464508-a143-4816-b31d-7eacfb99a8af
📒 Files selected for processing (1)
comfy/model_management.py
|
Thanks — fair pushback. Here is a single-threaded reproduction against current Reachability is reentrant GC, not concurrency. In The repro drives the real master I've also cut the PR back to the original scope: it now touches only A 5-case regression battery (two-model unload, device mismatch, |
|
I do not see the code in detach() that drops strong references real models @liminfei-amd . Please in your next response prove the verbatim FULL stack trace from a real run of unedited comfy code. |
|
Thanks for the pushback on Conceding the Corrected mechanism (and why it needs no edit to Verbatim stack from a real run of unedited comfy. Against unedited v0.24.0 (line numbers map to master's and the finalizer's own face, fired from inside the real callback: Both are real-code frames; the finalizer fires on its own under ordinary GC. Honest scope: Repro script (real ModelPatcher, automatic GC only, no
|
| variant | result |
|---|---|
| object-carry only | crash: free_memory NoneType.is_dynamic (L820) + cleanup_models (L989) |
object-carry + pin both loops (model = shift_model.model; if model is None: continue) |
free_memory faces gone (~59k calls clean) — but cleanup_models TypeError still fires |
+ guard is_dead() / cleanup_models() real_model deref |
green: 61k+ free_memory calls, 0 crashes |
The middle row is the point that argues against a free_memory-only localization: the cleanup_models face fires inside the weakref finalizer, where real_model can still be the plain None from __init__ (an entry observed before model_load() finished), so real_model() is a None() call. That crash is outside free_memory's scope entirely — no amount of list localization in free_memory can catch it. So the fix is two small parts, both required.
What I just pushed (on top of the object-carry commit):
free_memory: keep the object-carry, and additionally pinmodel = shift_model.model; if model is None: continuein both loops (the pin is what removes the survivingNoneType.is_dynamic); remove unloaded entries by identity, tolerating a reentrant pop.LoadedModel.is_dead()andcleanup_models(): readreal_modelinto a local and guard the call —rm = self.real_model; rm is not None and rm() is None— plus a bounds check on the reentrantpop. These two are the smallest change that closes the finalizer-side face.
The repro goes green with this, and the 5-case free_memory regression battery (two-model unload, device mismatch, keep_loaded, for_dynamic skip, three-model unload) still passes with unload counts intact. If you'd prefer the two finalizer guards as a separate commit (or a separate PR) to keep this one purely about the list, I'm happy to split it.
Pin the patcher (model = shift_model.model; continue if None) in both free_memory loops so a reentrant cleanup_models, popping current_loaded_models mid-iteration, cannot leave shift_model.model None for the is_dynamic()/model_unload path. Remove unloaded entries by identity (model is None post-unload, so LoadedModel.__eq__ is unsafe), tolerating an entry a finalizer already popped. Guard the real_model deref in is_dead() and cleanup_models(): real_model is the plain None from __init__ until model_load(), so an entry observed pre-load makes real_model() a None() call. cleanup_models() also removes by identity, mirroring free_memory, and tolerates reentrant pops.
3e8ef63 to
8436b15
Compare
8436b15 to
3e8ef63
Compare
Summary
free_memory()incomfy/model_management.pycaches absolute indices into theunlocked
current_loaded_modelslist and dereferencescurrent_loaded_models[i]in a later loop. Inside that loop
model_unload()drops the last reference to areal_model, so itsweakref.finalize(real_model, cleanup_models)runscleanup_models()reentrantly on the same thread and pops the dead entry fromcurrent_loaded_modelsmid-loop. The cached index is then stale, so the nextiteration raises
IndexError: list index out of range(or, when the index nowlands on a collected wrapper,
AttributeError: 'NoneType' object has no attribute 'is_dynamic') at thecurrent_loaded_models[i]dereference.This is reproducible single-threaded on current master with no background
threads — it is reentrant GC reclamation, not concurrency.
Change (scoped to
free_memoryonly)Carry the
LoadedModelobject throughcan_unloadinstead of its absoluteindex, operate on that object, and remove unloaded entries by identity after
the loop — tolerating any entry a reentrant
cleanup_models()already popped. Theindex
iis kept only as a stable, unique sort tiebreaker soLoadedModelobjects are never compared by
sorted().cleanup_models()and theLoadedModelaccessors are left unchanged.
The sort order, the returned
unloaded_modelslist, and thelen(unloaded_model)soft-empty-cache gate are all behaviour-preserving; only thestale-absolute-index dereference is fixed. Diff is +21/-10, one function.
Validation
Reproduced against the real master module (
comfy.model_management, currentmaster) by driving the realfree_memory()with lightweight stand-inpatchers that mirror the exact reference structure (patcher holds the real model
strongly;
LoadedModel._model/real_modelare weakrefs;detach()drops thestrong ref). Single thread, no GPU:
IndexError: list index out of rangeat thecurrent_loaded_models[i].model.is_dynamic()dereference. Instrumentationconfirms the mechanism:
threads: 1 [MainThread]→cleanup_models RE-ENTERED inside free_memory; current_loaded_models 2 → 1→IndexError.keep_loaded,for_dynamicskip, three-model unload): stock master crashes on 3 of them;this patch passes all 5 with the correct returned-unload counts.
py_compileclean. The fix is pure Python and hardware-independent.
Fixes #14365
AI usage disclosure: this change was prepared with AI assistance; a human reviewed and verified it and can explain every line. (Reviewed by AMD engineers before submission.)